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Abstract 



In this paper, we generalize the notions of centroids and barycenters to the broad class of 
information-theoretic distortion measures called Bregman divergences. Bregman divergences 
are versatile, and unify quadratic geometric distances with various statistical entropic measures. 
Because Bregman divergences are typically asymmetric, we consider both the left-sided and 
right-sided centroids and the symmetrized centroids, and prove that all three are unique. We 
give closed-form solutions for the sided centroids that are generalized means, and design a 
provably fast and efficient approximation algorithm for the symmetrized centroid based on 
its exact geometric characterization that requires solely to walk on the geodesic linking the 
two sided centroids. We report on our generic implementation for computing entropic centers 
of image clusters and entropic centers of multivariate normals, and compare our results with 
former ad-hoc methods. 



Keywords: Centroid, Bregman divergence, Legendre duality. 



Additional materials including C++ source codes, videos and Java applets available at 
http : //www . sonycsl . co . jp/person/nielsen/BregmanCen troids/| 



1 Introduction 



Content-based multimedia retrieval applications with their prominent image retrieval systems 
(CBIRs) are very popular nowadays with the broad availability of massive digital multimedia li- 
braries. CBIR systems spurred an intensive line of research for better ad-hoc feature extractions 
and effective yet accurate geometric clustering techniques. In a typical CBIR system [15], database 
images are processed offline during a preprocessing step by various feature extractors computing 
image characteristics such as color histograms. These features are aggregated into signature vectors 
that represent handles to images. Then given an online query image, the system first computes 
its signature, and search for the first, say h, best matches in the signature space. This requires 
to define an appropriate similarity measure between pairs of signatures. Designing an appropriate 
distance is tricky since the signature space is often heterogeneous (ie., cartesian product of feature 
spaces) and the usual Euclidean distance or L p -norms do not always make sense. For example, 
it is better to use the information-theoretic relative entropy, known as the Kullback-Leibler diver- 
gence, to measure the oriented distance between image histograms [15]. Efficiency is another key 
issue of CBIR systems since we do not want to compute the similarity measure (queryimage) for 
each image in the database. We rather want to prealably cluster the signatures efficiently dur- 
ing the preprocessing stage for fast retrieval of the best matches given query signature points. A 
first seminal work by Lloyd in 1957 [18] proposed the fe-means iterative clustering algorithm. In 
short, /c-means starts by choosing k seeds for cluster centers, associate to each point its "clos- 
est" cluster "center," update the various cluster centers, and reiterate until either convergence 
is met or the difference of the "loss function" between any two sucessive iterations goes below a 
prescribed threshold. Lloyd choosed the squared Euclidean distance since the minimum average 
intracluster distance yields centroids, the centers of mass of the respective clusters, and further 
proved that /c-means monotonically converges to a local optima. Cluster d's center a is defined by 
the minimization problem a = argmin c gC . ||cpj|| 2 = t^t X^eC, Pj"= ar g mm e AVG L 2(Cj, c), 
where \Ci\ denotes the cardinality of Cj. Half a century later, Banerjee et al. [4j showed that 
the /c-means algorithm extends to and only works for a broad family of distortion measures called 
Bregman divergences [8]. Bregman divergences Dp are parameterized families of distortion mea- 
sures that are defined by a strictly convex and differentiable generator function F : X — > IR + 
(with dim X = d) as Dp(p\\q) = F(p) — F(q)— < p — q,VF(q) >, where < •, • > denotes 
the inner product (< p,q >= Yli=iP^Q^ = P T l) an d VF(g) the gradient at point q (ie., 
\/F(q) = 9 qJ\] i dg(co )• Further, Teboulle [26] generalized this Bregman /c-means algorithm 
in 2007 by considering both hard and soft center-based clustering algorithms designed for both 
Bregman [8 j and Csiszar /-divergences [U 112] . The fundamental underlying primitive for these 
center-based clustering algorithms is to find the intrinsic best single representative of a cluster. 
As mentionned above, the centroid of a point set V = {p\, ...,p n } is defined as the optimizer of 
the minimum average distance: c = argmin c - Ylid(c,Pi). For oriented distance functions such as 
Bregman divergences that are not necessarily symmetric, we thus distinguish sided and symmetrized 
centroids as follows: = argmin ce ;r i YILi D F(Pi\\[c}), cf = argmin cG ^ ± YILi D F§c}\Pi), and 

c F = argmin ce ^- i Ya=i DF ~ " ^^'° F — . The first right-type and left-type centroids and 
c£ are called sided centroids, and the third type centroid c F is called the symmetrized Bregman 
centroid. Except for the class of generalized quadratic distances with generator Fq(x) = x T Qx, 
Sf(p',q) = ^Mw^glgM j s no f a Bregman divergence, see [20]. Since the three centroids 
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coincide with the center of mass for symmetric Bregman divergences, we consider in the re- 
mainder asymmetric Bregman divergences. We write for short AVG F {V\\c) = ± £2=1 D F(pi\\c), 
AVG F (c\\V) = I Er=i D F (c\\ Pi ) and AVG F (c; V) = \ Y? i=x S F (c; Pi ) , so that we get respectively 
= argmin cg ;t A\G F (V\ |c), c£ = argmin ce ^- AVG F (c\\V) and c F = argmin cg ^- AVG F (V; c). 
The symmetrized Kullback-Leibler |25[ [T9] and COSH centroids |10| [29] (symmetrized Itakura- 
Saito divergence obtained for F(x) = — logx, the Burg entropy) are certainly the most famous 
symmetrized Bregman centroids, widely used in image and sound processing. These symmetrized 
centroids play a fundamental role in applications that require to handle symmetric information- 
theoretic distances. 

1.1 Related work, contributions and paper organization 

Prior work in the literature is sparse and disparate. We summarize below main references that 
will be concisely revisited in section [2] under our notational conventions. Ben-Tal et al. [7j studied 
entropic means as the minimum average optimization for various distortion measures such as the 
/-divergences and Bregman divergences. Their study is limited to the sided left-type (generalized 
means) centroids. Basseville and Cardoso [6 compared in the 1-page paper the generalized/entropic 
mean values for two entropy-based classes of divergences: /-divergences [12] and Jensen-Shannon 
divergences [13]. The closest recent work to our study is Veldhuis' approximation method |27j for 
computing the symmetrical Kullback-Leibler centroid. 
We summarize our contributions as follows: 

• In section [2] we show that the two sided Bregman centroids and with respect to Bregman 
divergence D F are unique and easily obtained as generalized means for the identity and VF 
functions, respectively. We extend Sibson' s notion of information radius [21] for these sided 
centroids, and show that they are both equal to the .F-Jensen difference, a generalized Jensen- 
Shannon divergence [T7] also known as Burbea-Rao divergences [9]. 

• Section [3] proceeds by first showing how to reduce the symmetrized min AYG F (V; c) optimiza- 
tion problem into a simpler system that depends only on the two sided centroids and c^. 
We then geometrically characterize exactly the symmetrized centroid as the intersection point 
of the geodesic linking the sided centroids with a new type of divergence bisector: the mixed- 
type bisector. This yields a simple and efficient dichotomic search procedure that provably 
converges fast to the exact symmetrized Bregman centroid. 

• The symmetrized Kullback-Leibler divergence ( J-divergence) and symmetrized Itakura-Saito 
divergence (COSH distance) are often used in sound/image applications, where our fast 
geodesic dichotomic walk algorithm converging to the unique symmetrized Bregman centroid 
comes in handy over former complex adhoc methods |19| [TO] |25| [3] [23] . Section [4] considers 
applications of the generic geodesic- walk algorithm to two cases: 

— The symmetrized Kullback-Leibler for probability mass functions represented as d- 
dimensional points lying in the (d — l)-dimensional simplex S d . These discrete dis- 
tributions are handled as multinomials of the exponential families [20 with d— 1 degrees 
of freedom. We instantiate the generic geodesic-walk algorithm for that setting, show 
how it compares favorably with the prior convex optimization work of Veldhuis |27[ [3] , 
and validate formally experimental remarks of Veldhuis. 
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— The symmetrized Kullback-Leibler of multivariate normal distributions. We describe the 
geodesic-walk for this particular mixed-type exponential family of multivariate normals, 
and explain the Legendre mixed-type vector /matrix dual convex conjugates defining the 
corresponding Bregman divergences. This yields a simple, fast and elegant geometric 
method compared to the former overly complex method of Myrvoll and Soong [19] that 
relies on solving Riccati matrix equations. 



2 Sided Bregman centroids 
2.1 Right-type centroid 

We first prove that the right-type centroid is independent of the considered Bregman divergence 
Dp: cf(V) = p = \ 'Y^l=\'Pi is always the center of mass. Although this result is well-known 
in disguise in information geometry [2], it was again recently brought up to the attention of the 
machine learning community by Banerjee et al. [1] who proved that Lloyd's iterative fe-means 
"centroid" clustering algorithm |18] generalizes to the class of Bregman divergences. We state the 
result and give the proof for completeness and familizaring us with notations. 

Theorem 2.1 The right-type sided Bregman centroid of a set V of n points p\, ...p n , 
defined as the minimizer for the average right divergence = argmin c X^=i n^F{Pi\\c) = 
argmin c AVGir("P||c) ; is unique, independent of the selected divergence Dp, and coincides with 
the center of mass = cr = p = i Y^h=x Pi- 

Proof For a given point q, the right-type average divergence is defined as AVGFiVWq) = 
J27=i n^FijPiWl)- Expanding the terms Dp{pi\\qys using the definition of Bregman divergence, 
we get AYG F {V\\q) = £™ =1 \ {F{ Pi ) - F{q)- < Pi - q,VF(q) >). Subtracting and adding F(p) 
to the right-hand side yields 

AVG F (V,q) = (^^F(p i )-F(p)^ + ^F(p)-F(q)-^^<p i - 

= (E l ~ F ^) - F (P^j + ( F ® - F ^ - (it - 9). , 



l j^F{ Pl ) - F{p)\ +D F (p\\Q)- 
, i=i / 



Observe that since Y^l=i k F (P i )~ F (p) ^ s independent of q, minimizing AVG^('P||(/) is equivalent 
to minimizing Dp{p\\q). Using the fact that Bregman divergences Dp(p\\q) are non-negative, 
D F (p\\q) > 0, and equal to zero if and only if p = q, we conclude that = argmin g AVGF{V\\q) = 
p, namely the center of mass of the point set. The minimization remainder, representing the 
"information radius" (by generalizing the notion introduced by Sibson |24j for the relative entropy), 
is JSf(P) = ^ SILi F(Pi)~F(P) — 0) which bears the name of the F-Jensen differenc^] |9 . For F = 

1 In the paper [9], it is used for strictly concave function H — —F on a weight distribution vector it: Jn(pi, ■■■,Pn) = 
-ff(53"=i n iPi) ~ Y17=i KiH{pi). Here, we consider uniform weighting distribution ir = u (with -Ri — «). 
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—H = xlogx the negative Shannon entropy, Jf is known as the Jensen-Shannon divergence |17] : 
JS("P) = H(Y^l = iPi) — YH=i n^(^)- The Jensen-Shannon divergence is also known as half of the 
Jeffreys divergence (JD): JS(P; Q) = ^JD(P; Q), and can be interpreted as the expected information 
gain when discovering which probability distribution is drawn from (either P or Q). The Jensen- 
Shannon divergence can also be interpreted as the noisy channel capacity with two inputs giving 
output distributions P and Q Jensen-Shannon divergences are also useful for providing both 
lower and upper bounds for Bayes probability of error in decision problems [T7] . 

2.2 Dual divergence and left-type centroid 

Before characterizing the left-type sided Bregman centroid, we recall the fundamental duality 
of convex analysis: convex conjugation by Legendre transformation. We refer to [20] for de- 
tailed explanations that we concisely summarize here as follows: Any Bregman generator func- 
tion F admits a dual Bregman generator function G = F* via the Legendre transformation 
G(y) = sup xeX {< y,x > —F(x)}. The supremum is reached at the unique point where the 
gradient of G{x) =< y,x > —F{x) vanishes, that is when y = VF(x). Writing X' F for the 
gradient space {x' = VF(x)\x G X}, the convex conjugate G = F* of F is the function 
X' F C M. d — > M defined by F*(x') =< x,x' > —F(x). It follows from Legendre transformation 
that any Bregman divergence Dp admits a dual Bregman divergence Dp* related to Dp as follows: 
D F (p\\q) = F(p) + F*(VF(q))- <p,VF(q) >= F(p) + F*(q')- <p,q' >= D F *(q'\\p'). Using the 
convex conjugation twice, we get the following (dual) theorem for the left-type Bregman centroid: 

Theorem 2.2 The left-type sided Bregman centroid c£, defined as the minimizer for the av- 
erage left divergence = argmin ce ^ AVG£(c||"P), is the unique point c£ S X such that 
c£ = (VF)^ 1 ^/) = (VF) _1 (^™ =1 VF(pi)), where p' = (Vf) is the center of mass for the 
gradient point set Vf = {p[ = VF(p«) | pi G V}. 

Proof Using the dual Bregman divergence Dp* induced by the convex conjugate F* of F, we 
observe that the left-type centroid c£ = argmin ce ^ AVGf(c\\V) is obtained equivalently by mini- 
mizing the dual right-type centroid problem on the gradient point set: argmin^,, g ^ AVG f*(T , f'\\c'), 
where we recall that p' = VF(p) and Vf = {VF(pi), VF(p n )} denote the gradient point set. 
Thus the left-type Bregman centroid c£ is computed as the reciprocal gradient of the center of mass 
of the gradient point set eg* (TV) = £ £™ =1 VF( Pi ) : c F L = (V^-^ELi k VF (Pi)) = (VF)- 1 ^')- 
It follows that the left-type Bregman centroid is unique. 

Observe that the duality also proves that the information radius for the left-type centroid is the 
same F- Jensen difference (Jensen-Shannon divergence for the convex entropic function F). 

Corollary 2.3 The information radius equality AVG F (7 : '||c^) = PNGp^lWV) = JS^) = 
n Ym=i F(Pi) — F(P) > is the F- Jensen- Shannon divergence for the uniform weight distribution. 

2.3 Generalized means centers and barycenters 

We show that both sided centroids are generalized means also called quasi- arithmetic or /-means. 
We first recall the basic definition of generalized meang^] that generalizes the usual arithmetic and 

2 Studied independently in 1930 by Kolmogorov and Nagumo, see [22]. A more detailed account is given in [16] . 
Chapter 3. 
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geometric means. For a strictly continuous and monotonous function /, the generalized mean [22] 
of a sequence V of n real numbers V = {v\, v n } is defined as M(V; /) = / _1 (n Yli=i fi v i))- The 
generalized means include the Pythagoras' arithmetic, geometric, and harmonic means, obtained 
respectively for functions f(x) = x, f(x) = logx and f(x) = 1 (see appendix |a| . Note that since 
/ is injective, its reciprocal function f^ 1 is properly defined. Further, since / is monotonous, it is 
noticed that the generalized mean is necessarily bounded between the extremal set elements min, Vi 
and maxjfj: minj Xi < M(V;f) < maxiXj. In fact, finding these minimum and maximum set 
elements can be treated themselves as a special generalized power mean, another generalized mean 
for f(x) = x p in the limit case p — > ±00. 

These generalized means highlight a bijection: Bregman divergence Dp *-* VF-means. The one- 
to-one mapping holds because Bregman generator functions F are strictly convex and differentiable 
functions chosen up to an affine term [20]. This affine invariant property transposes to generalized 
means as an offset /scaling invariant property: M(S;f) = M(S;af + b) Va £ and \/b G M.. 
Although we have considered centroids for simplicity (ie., uniform weight distribution on the input 
set V), this approach generalizes straightforwardly to barycenters defined as solutions of minimum 
average optimization problems for arbitrary unit weight vector w (Vi, Wi>0 with \\w\\ = 1): 

Theorem 2.4 Bregman divergences are in bijection with generalized means. The right-type 
bary center bf>(w) is independent of F and computed as the weighted arithmetic mean on the 
point set, a generalized mean for the identity function: b^(V;w) = bji(V]w) = M(V;x;w) with 
M(V; f;w) = / _1 (^r=i w if( v i))- The left-type Bregman barycenter b F is computed as a gener- 
alized mean on the point set for the gradient function: b F (V) = M(V; VF; w). The information 
radius of sided barycenters is JSi?(7 7 ; w) = Yli=i w iF{Pi) ~ -^(X]f=i w iPi)- 

3 Symmetrized Bregman centroid 
3.1 Revisiting the optimization problem 

For asymmetric Bregman divergences, the symmetrized Bregman centroid is defined by the follow- 
ing optimization problem c F = arg min cg< y Y17=i Df ^ p '^ Df ^ p% ^ = argmin cg ^ AVG^; c). We 
simplify this optimization problem to another constant-size system relying only the right-type and 
left-type sided centroids, and c F , respectively. This will prove that the symmetrized Bregman 
centroid is uniquely defined as the zeroing argument of a sided centroid function by generalizing the 
approach of Veldhuis [27] that studied the special case of the symmetrized discrete Kullback-Leibler 
divergence, also known as J-divergence. 

Lemma 3.1 The symmetrized Bregman centroid c F is unique and obtained by minimizing 
mm g€ x D F (c^\\q) + D F (q\\cf i ): c F = argmin g6 ^ _D F (c£||g) + Dp{q\\c F ). 

Proof We have previously shown that the right-type average divergence can be rewritten as 
AVGF(V\\q) = (J22=i k^iPi) ~ F(P)) + Af(£>||<7)- Using Legendre transformation, we have 
similarly AYG F (q\ \V) = A\G F *(V F '\\q') = (£™ =1 ^Vi) " + D F ,{p' F \\q' F ). But 

D F *(p' F \\q F ) = D F **(VF* o VF(q)\\VF*(Y;i = lVF( Pi ))) = D F (q\\c F ) since F** = F, VF* = 
VF _1 and VF* o VF(q) = q from Legendre duality. Combining these two sum averages, 
it comes that minimizing argmin cg ^- | (AVG F (V\\q) + AVG F (q\\V)) boils down to minimizing 
arg mm q( zx Dp^^ \\ q) + D F (q\\c F ), after removing all terms independent of q. The solution is 
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unique since the optimization problem argmin gg ^- Dp(c R \\q) + Dp(q\\c F ) can be itself rewritten 
as argminqg^- D F *(VF(q)\\VF(c^)) + D F (q\ |c£), where V-F(g) is monotonous and D F (■]]■) and 
Df*(-||-) are both convex in the first argument (but not necessarily in the second). Therefore the 
optimization problem is convex and admits a unique solution. 



3.2 Geometric characterization 



We now characterize the exact geometric location of the symmetrized Bregman centroid by intro- 
ducing a new type of bisectoi[^] called the mixed-type bisector: 

Theorem 3.2 The symmetrized Bregman centroid c F is uniquely defined as the minimizer of 
D F (c R \\q) + D F (q\\c F ). It is defined geometrically as c F = Tp(c F >,c F ) H Mj?(c^, c F ), where 



{(VF) _1 ((l-A)VF(c£)-|-AVF(c£)) | A G [0, 1]} is the geodesic linking cf, to c F L , and 

= D F {x\\c F L )}. 



T F (c 

M F (c R ,c^) is the mixed-type Bregman bisector: M F (c R , c^) = {x S X \ D F (c 



-r\ 



Proof. First, let us prove by contradiction that q necessarily belongs to the geodesic T(c R ,c F ). 
Assume q does not belong to that geodesic and consider the point 
q± that is the Bregman perpendicular projection of q onto the (con- 
vex) geodesic [20]: q± = arg min tgr(c F c f\ D F (t\\q) as depicted in 
Using Bregman Pythagoras 7 theorenr\ twice (see |20|). 
we have: D F {c^\\q) > D F (c R \\q±) + D F (q±\\q) and D F {q\\c F ) > 



D F (q\\q ± ) + D F (q ± \ 
D F {c F R \\ qi _) + D F (q 



\c[) 



Thus, we get D F (c F \\q) + D F {q\\c F ) 



> 



|c£) + (D F (q ± \\q) + D F (q\\q ± )). But since 



D F (q±\\q) + D F (q\\q±) > 0, we reach the contradiction since 




D F {c R \\q ± ) + D F (q A 



< D F (c R \\q) + D F (q\\c{). Therefore 



Figure 1: The symmetrized 
Bregman centroid necessarily 
lies on the geodesic passing 
through the two sided cen- 
troids c R and c F . 



q necessarily belongs to the geodesic T(c R ,c F ). Second, let us 
show that q necessarily belongs to the mixed- type bisector. As- 
sume it is not the case. Then D F (c R \\q) ^ D F (q\\c F ) and sup- 
pose without loss of generality that Dp(c R \\q) > D F (q\\c F ). Let 
A = Dp^Wq) - D F (q\\c F ) > and l Q = D F (q\\c[) so that 
D F (c R \\q) + D F (q\\c F ) = 21q + A. Now move q on the geodesic 

towards c R by an amount such that = Dp(q\\c F ) < Iq + 5 A. Clearly, D F (c R \\q) < Iq and 
Dp(c R \\q) + D F (q\\c F ) < 21q + ^A contradicting the fact that q was not on the mixed-type bi- 
sector. 

The equation of the mixed-type bisector Mp(p, q) is neither linear in x nor in x' = V-F(x) (nor in 
x, = (x, x')) because of the term F(x), and can thus only be manipulated implicitly in the remainder: 
M F (p,q) = {x£ X \ F(p)-F{q)-2F{x)- <p,x' > + < x,x' > + < x,q' > - < q,q' >= 0}. The 
mixed- type bisector is not necessarily connected (eg., extended Kullback-Leibler divergence), and 
yields the full space X for symmetric Bregman divergences (ie., generalized quadratic distances). 

Using the fact that the symmetrized Bregman centroid necessarily lies on the geodesic linking 
the two sided centroids c R and cf, we get the following corollary: 



3 See |20j for the affine/curved and symmetrized bisectors studied in the context of Bregman Voronoi diagrams. 

4 Bregman Pythagoras' theorem is also called the generalized Pythagoras' theorem, and is stated as follows: 
Df{v\\<1) > D{p\\Pn{q)) + DF(Pfi(q)\\q) where Pn(q) = argmin^gn DfMI?) is the Bregman projection of q onto a 
convex set Q, see [4]. 
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(a) 


(b)- 




eg = (0.47, 0.78), eg = (0.25,0.76), AVG F (V\\c^ L ) = 4.29 


eg = (0.43, 0.11), 


eg = (0.13,0.07), AVG F (-P![cg_ L ) = 22.70 


c F = (0.35,0.77), ,AVG F (V;c F ) = 3.96 


c F = (0.24, 0.09), AVG f (P;c F ) = 16.91 



Figure 2: Bregman centroids for (a) the extended Kullback-Leibler and (b) Itakura-Saito diver- 
gences on the open square X =]0, 1[ 2 . Right-sided and left-sided, and symmetrized centroids are 
displayed respectively as red, blue and purple points. The geodesic linking the right-sided centroid 
to the left-sided one is shown in grey, and the mixed-type bisector is displayed in purple. 

Corollary 3.3 The symmetrized Bregman divergence minimization problem is both lower and up- 
per bounded as follows: JS F {T) < A\G F {V;c F ) < D F (c%\\c%). 

Figure [2] displays the mixed-type bisector, and sided and symmetrized Bregman centroids for 
the extended^ Kullback-Leibler (eKL) and Itakura-Saito (IS) divergences. 

3.3 A simple geodesic-walk dichotomic approximation algorithm 

The exact geometric characterization of the symmetrized Bregman centroid provides us a simple 
method to approximately converge to c F : Namely, we perform a dichotomic walk on the geodesic 
linking the sided centroids and c F . This dichotomic search yields a novel efficient algorithm 
that enables us to solve for arbitrary symmetrized Bregman centroids, beyond the former Kullback- 
Leibler casa^Jof Veldhuis [27]: We initially consider A £ [A m = 0, Xm = 1] and repeat the following 
steps until Xm — X m < e, for e > a prescribed precision threshold: 

Geodesic walk. Compute interval midpoint A/j = Am + AM and corresponding geodesic point 

q h = (VFr 1 ((l-X h )VF(4) + X h VF(c 1 [)), 

Mixed-type bisector side. Evaluate the sign of Df(c^| \qh) ~ Dp{qh\\c^), and 

Dichotomy. Branch on [Xh, Xm] if the sign is negative, or on [A m , A^] otherwise. 

Note that any point on the geodesic (including the midpoint qi) or on the mixed-type bi- 

2 

sector provides an upperbound AVGi7("P; qt) on the minimization task. Although it was noted 
experimentally by Veldhuis [27] for the Kullback-Leibler divergence that this midpoint provides 

5 We relax the probability distributions to belong to the positive orthant (ie., unnormalized probability mass 
function) instead of the open simplex S d . 

6 Veldhuis' method [27] is based on the general purpose Lagrangian multiplier method with a normalization step. 
It requires to set up one threshold for the outer loop and two prescribed thresholds for the inner loops. For example, 
Aradilla et al. [3] set the number of steps of the outer loop and inner loops to ten and five iterations each, respectively. 
Appendix |B] provides a synopsis of Veldhuis' method. 
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"experimentally" a good approximation, let us emphasize that is not true in general, as depicted 
in Figure ^lb) for the Itakura-Saito divergence. 




Theorem 3.4 The symmetrized Bregman centroid can be approximated within a prescribed pre- 
cision by a simple dichotomic walk on the geodesic r(c^,c^) helped by the mixed-type bisector 
Mp(c^,c^). In general, symmetrized Bregman centroids do not admit closed-form solutions. 

In practice, we can control the stopping criterion e by taking the difference Wf(q) = Dp(c^\ \q) — 
Dp{q\\c L ]) between two successive iterations since it monotonically decreases. The number of it- 
erations can also be theoretically upper-bounded as a function of e using the maximum value of 
the Hessian hp = max xgr ^ c F C F) ||i/^(x)|| 2 along the geodesic r(c^,c£) by mimicking the analysis 
in [21] (See Lemma 3 of [21 J. 

4 Applications of the dichotomic geodesic-walk algorithm 

4.1 Revisiting the centroid of symmetrized Kullback-Leibler divergence 

Consider a random variable Q on d events f2 = {fii, O^}, called the sample space. Its associated 
discrete distribution q (with Pr(Q = fij) = Q®) belongs to the topological^/ open (d— l)-dimensional 
probability simplex S d of W^: Yli=i = 1 an d G % > 0. Distributions q arise 

often in practice from image intensity histograms^} To measure the distance between two discrete 
distributions p and q, we use the Kullback-Leibler divergence also known as relative entropy or 
discrimination information: KL(p||g) = Yli=i ~(T) ■ Note that this information measure is 

unbounded whenever there exists = for a non-zero qW > 0. But since we assumed that 
both p and q belongs to the open probability simplex S d , this case does not occur in our setting: 
< KL(p| \q) < oo with left-hand side equality if and only if p = q. The symmetrized KL divergence 
|(KL(p||</) +KL(g||p)) is also called J-divergence or SKL divergence, for short. 

The random variable Q can also be interpreted as a regular exponential family member [20] in 
statistics of order d — 1, generalizing the Bernoulli random variable. Namely, Q is a multinomial 
random variable indexed by a {d — l)-dimensional parameter vector 9 q . These multinomial distribu- 
tions belong to the broad class of exponential families [20_ in statistics for which have the important 
property that KL(p(9 p )\\q(9 g )) = Dp(9 g \\9 p ), see [2D]. That is, this property allows us to bypass the 
fastidious integral computations of Kullback-Leibler divergences and replace it by a simple gradient 
derivatives for probability distributions belonging to the same exponential families. From the canon- 
ical decomposition exp(< 9,t(x) > —F(0) + C(x)) of exponential families [20J, it comes out that the 

t -\ (*) (*) 

natural parameters associated with the sufficient statistics t(x) are = log = log — ... 

since q^ = 1 — X^j=i 1^- The natural parameter space is the topologically open K 6 ^ 1 . The log 
normalizer is F{9) = log(l + Yli=i expfl®), called the multivariate logistic entropy. It follows that 
the gradient is X7F(9) = r] = (r/j)j with rji = ^—^^r~ > q) an d yields the dual parameterization of 

the expectation parameters: r\ = VqF(9). The expectation parameters play an important role in 
practice for infering the distributions from identically and independently distributed observations 

7 To ensure to all bins of the histograms are non-void, we add a small quantity e to each bin, and normalize to 
unit. This is the same as considering the random variable Q + eU where U is a unit random variable. 
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x%, x n . Indeed, the maximum likelihood estimator of exponential families is simply given by the 
center of mass of the sufficient statistics computed on the observations: fj = - Ya=i see P3- 

Observe in this case that the log normalizer function is not separable (F(x) ^ Ef=i fi( x ^))- The 
function F and F* = J V -1 -F are convex conjugates obtained by the Legendre transformation that 
maps both domains and functions (Xp,F) < — ► (Xf* , F* ) . We get the inverse V _1 F = (VF)^ 1 of 

the gradient \7F as V _1 i ? (r/) = ^log ^ ■^yF 1 u) ^ = ®' T nus it comes that the Legendre convex 

conjugate is F*(r]) = fef=i V® log??®) + (1 - Eti V®) log(l - Ef=iV l) ), the ^-ar?/ entropy. 
Observe that for d = 2, this yields the usual bit entropjj^] function F*(rj) = r/logry+(l — 77) log(l — 77). 

To convert back from the multinomial (d — l)-order natural parameters 9 to discrete <i-bin 
normalized probability mass functions (eg., histograms) A E we use the following mapping: 
qW = ^ d rr and g« = "fP eW — — for all i E {1, This gives a wo/id lie., 

normalized) distribution q £ S d for any 9 E Note that the coefficients in 6 may be either 

positive or negative depending on the ratio of the probability of the ith event with the last one, 

As mentioned above, it turns out that the Kullback-Leibler measure can be computed from the 
Bregman divergence associated to the multinomial by swapping arguments: KL(p||g) = DF(6 q \\8 p ), 
where the Bregman divergence DF(8 g \\6 p ) = F{0 q ) — F(6 P )— < 9 q — 6 p ,VF(0 p ) > is defined for 
the strictly convex (V 2 F > 0) and diffentiable log normalizer F(9) = log(l + Ei=i ex P 

0W). We 

implemented the geodesic-walk approximation algorithm for that context, and observed in practice 
that the SKL centroid deviates much (20% or more in information radius) from the "middle" point 
of the geodesic (A = g), thus reflecting the asymmetry of the underlying space. Further, note that 
our geodesic- walk algorithm proves the empirical remark of Veldhuis [27] that "... the assumption 
that the SKL centroid is a linear combination of the arithmetic and normalized geometric mean 
must be rejected." Appendix [B] displays side by side Veldhuis' and the geodesic-walk methods 
for reference, and appendix [C] report on the sided and symmetrized Bregman centroids of two 
probability mass functions obtained from intensity histograms of apple images. Observe that the 
symmetrized centroid distribution may be above both source distributions, but this is never the case 
in the natural parameter domain since the two sided centroids are generalized means, and that the 
symmetrized centroid belongs to the geodesic linking these two centroids (ie., a barycenter mean 
of the two sided centroids). 

Computing the centroid of a set of image histograms, a center robust to outliers, allows one 
to design novel applications in information retrieval and image processing. For example, we can 
perform simultaneous contrast image enhancement by first computing the histogram centroid of a 
group of pictures, and then performing histogram normalization to that same reference histogram. 

4.2 Entropic means of multivariate normal distributions 

The probability density function of an arbitary d-variate normal N"(fJ>, S) with mean fi and variance- 
covariance matrix S is given by Pr(A" = x) = p(x;fi,T,) = j 1 exp ( — ^ x ^ ^ — — M M . 

(2tt)2 VdetE V / 

It is certainly the engineer's favorite family of distributions that nevertheless becomes intricate 



This generalizes the ID case of Kullback-Leibler's Bernoulli divergence: F(x) = log(l + exp a;) is the logistic 

cxp X 
1+cxp X 



entropy, F'(x) — x and F' 1 = log j^., and F*(x) = 2; log x + (1 — x) log(l — x), is the dual bit entropy. 
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to use as dimension goes beyond 3D. The density function can be rewritten into the canoni- 
cal decomposition to yield an exponential family of order D = f^tj) (the mean vector and 
the positive definite matrix E -1 accounting respectively for d and d ^ d ^ parameters). The 
sufficient statistics is stacked onto a two-part D-dimensional vector x = (x,—^xx T ) associ- 
ated with the natural parameter = (9,0) = Accordingly, the source pa- 
rameter are denoted by A = (/x, E). The log normalizer specifying the exponential family is 
F(0) = \Tt(Q- 1 99 t ) - ^logdete + ^logvr (see HH [2]). To compute the Kullback-Leibler di- 
vergence of two normal distributions N p = AA(/x p ,E p ) and N q = A/"(/i 9 , S g ), we use the Bregman 
divergence as follows: KL(N p \\N q ) = D F (@ q \\@ p ) = F{@ q ) - F(@ p )- < (0, - @ p ),VF(@ p ) >. 
The inner product < P , Q q > is a composite inner product obtained as the sum of inner products 
of vectors and matrices: < p ,0 g >=< p ,0 g > + < 9 p ,9 q >. For matrices, the inner product 
< p ,0 g > is defined by the trace of the matrix product p 0j: < p ,0 g >= Tr(0 p 0j). In 
this setting, however, computing the gradient, inverse gradient and finding the Legendre convex 
conjugates are quite involved operations. Yoshizawa and Tanabe [30] investigated in a unifying 
framework the differential geometries of the families of probability distributions of arbitrary mul- 
tivariate normals from both the viewpoint of Riemannian geometry relying on the corresponding 
Fisher information metric, and from the viewpoint of Kullback-Leibler information, yielding the 
classic torsion- free flat shape geometry with dual affine connections [2] . Yoshizawa and Tanabe [30] 
carried out computations that yield the dual natural/expectation coordinate systems arising from 
the canonical decompotion of the density function p(x; ji, S): 

V = M \ . , a ( A = n \ a x £ _ / 9 = E^/i 



The strictly convex and differentiable dual Bregman generator functions (ie., potential functions 
in information geometry) are F(&) = \Tx(Q- l 99 T ) - \ log det© + | log vr, and F*(H) = -\ log(l + 
ri T H~ lr ri) — ^ log det(— H) — | log(27re) defined respectively both on the topologically open space M. d x 
Cone^". Note that removing constant terms does not change the Bregman divergences. The H 44> 
coordinate transformations obtained from the Legendre transformation (with (V-F) -1 = VF*) are 

given by H - V S F(9) _ ( ^« ) _ ( ^ |^ (e - lflT ) = ( _ (E+ %„ r) ) 

and e = V^<*) = ( W» ) = ( ) = ( pt )■ Tnese tania 

simplifies significantly when we restrict ourselves to diagonal-only variance-covariance matrices Ej, 
spherical normals Ej = ail, or univariate normals J\f{m, ai). 

Computing the symmetrized Kullback-Leibler centroid of a set of normals (Gaussians) is an 
essential operation for clustering sets of multivariate normal distributions using center-based k- 
means algorithm |14[ 126]. Myrvoll and Soong [19^ described the use of multivariate normal clustering 
in automatic speech recognition. They derived a numerical local algorithm for computing the 
multivariate normal centroid by solving iteratively Riccati matrix equations, initializing the solution 
to the so-called "expectation centroid" |23| . Their method is a complex and costly since it also 
involves solving for eigensystems. In comparison, our geometric geodesic dichotomic walk procedure 
for computing the entropic centroid, a Bregman symmetrized centroid, yields an extremely fast and 
simple algorithm with guaranteed performance. 
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A Dominance relationships of sided centroid coordinates 



The table below illustrates the bijection between Bregman divergences and generalized /-means for 
the Pythagoras' means (ie., extend to separable Bregman divergences): 



Bregman divergence Dp 


F < 


► f = F> / _1 = (F') _1 


/-mean 
(Generalized means) 


Squared Euclidean distance 


2 X ' 


X X 


Arithmetic mean 


(half squared loss) 








Kullback-Leibler divergence 


x log X — X < 


log x exp x 


Geometric mean 


(Ext. neg. Shannon entropy) 






(IE=i*;)- 


Itakura-Saito divergence 


— log x < 


X X 


Harmonic mean 


(Burg entropy) 






n 

Tr^n 1 

^J = l Xj 



We give a characterization of the coordinates * of the right-type average centroid (center of 

mass) with respect to those of the left-type average centroid, the c F ^ coordinates. 
Corollary 

Provided that VF is convex (e.g., Kullback-Leibler divergence), we have 

c Fd) > c Fd) for all 

i G {1, ...,d}. Similarly, for concave gradient function (e.g., exponential loss), we have < c F ^ 
for all i G { 1 , . . . , d} . 

Proof Assume VF is convex and apply Jensen's inequality to - X^=i VF(pj). Consider for 
simplicity without loss of generality ID functions. We have ^ Y17=i ^F{Pi) — ^-^(nSiLi^*)- 
Because (VF) _1 is a monotonous function, we get c£ = (VF)~ 1 (^ Y^i=i ^F(Pi)) — 

(VF)- X (VF(I E?=iPi)) = \ EtiPi = 4- Thus we conclude that c F R (t) > c F L (t) Vz G {1, d] for 
convex VF (proof performed coordinatewise) . For concave VF functions (i.e., dual divergences of 
VF-convex primal divergences), we simply reverse the inequality (e.g., the exponential loss dual of 
the Kullback-Leibler divergence). 

Note that Bregman divergences Dp may neither have their gradient VF convex nor concave. 
The bit entropy F{x) = x log x + (1 — x) log(l — x) yielding the logistic loss Dp is such an example. 
In that case, we cannot a priori order the coordinates of and c F . 

This dominance relationship can be verified for the plot in natural parameter space of Ap- 
pendix C. 
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B Synopsis of Veldhuis' and the generic geodesic-walk methods 



The table below provides a side-by-side comparison of Veldhuis' J-divergence centroid convex pro- 
gramming method [27] with our generic symmetrized Bregman centroid (entropic means) geodesic- 
walk instantiated for the Kullback-Leibler divergence. 



Veldhuis' algorithm 



Geodesic-walk algorithm 



INPUT: 

n discrete distributions (ft, ... : g n of with 
Vi £{!,..., n) q i = (qf\...,q\ d] ). 

INITIALIZATION 

Arithmetic mean: 

vfc s« = i Eti it 1 

Geometric normalized mean: 
Vfc gM = with Vfc g<*> = 

a - -I 



MAIN LOOP: 
For 1 to 10 

vfc »w = 

» ,1*' e 

Vfc it*) = 1 



(nr=i< 



INNER LOOP 1: 
For 1 to 5 

Vfc z<*) <- i<*) - '■- 



INNER LOOP 2: 
For 1 to 5 

cv i — ry — ' -11 



CENTROID: 

VfccC 1 ' = ^OgOexpa 



INPUT: 

n discrete distributions , q n of (S d with 

V«€{1 n}*^,} 1 ',....^) 

CONVERSION: 

Probability mass ftinction — > multinomial 



i Yfc = 



F(<?) =log(l + £ J ti 1 exp(?")) 

ie{l,....d-l} 



VF(f?) = 



(VF)- 1 '" 1 = flog 



E{1 d-1} 



INITIALIZATION: 
Arithmetic mean: 9^ = J^ILi 
VF-mean: flf = VF+I^! VF{0;)) 
A m = 0, A M = 1 



GEODESIC DICHOTOMIC WALK: 
While Ajyj — A m > precision do 

9 = (VF)-!((1 - A)VF(cg) + AVF(cf)] 
if £>p(e£ ||fi) > D F {0||c£) then 

A M = A 
else 
A„, = A 



CONVERSION: 

Multinomial — > Probability mass function 



Vi Vfc «f > = 



1+ S!=i< 1+ "' p9 ") 



Both C++ source codes with cross-check validations are available at 



http : //www. sonycsl . co . jp/person/nielsen/BregmanCentroids/ 
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C Image histogram centroids with respect to the relative entropy 

The plots below show the Kullback-Leibler sided and symmetrized centroids on two distributions 
taken as the intensity histograms of the apple images shown below. Observe that the symmetrized 
centroid distribution is above both source distributions for intensity range [100 — 145], but this is 
never the case in the natural parameter space due to the property of generalized means. 





100 



150 



Centroid theta (natural parameter space) 



Right centroid - 
Left centroid - 
Symmetrized centroid - 
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D Entropic sided and symmetrized centroids of bivariate normal 
distributions 



We report on our implementation for multivariate normal distributions below. Observe that the 
right-type Kullback-Leibler centroid is a left-type Bregman centroid for the log normalizer of the 
exponential family. Our method allowed us to verify that the simple generalized V-F-mean formula 
c£CP) = (VF) -1 Qr^ =1 ^VF(k)) coincides with that of the NIPS*06 paper [H]. Furthermore, we 
would like to stress out that our method extends to arbitrary entropic centroids of members of the 
same exponential family. 

The figure below plots the entropic right- and left-sided and the symmetrized centroids in red, 
blue and green respectively for a set that consists of two bivariate normals (D = rf ( rf + 3 ) = 5). The 
geodesic midpoint interpolant (obtained for A = |) is very close to the symmetrized centroid, and 
shown in magenta. 



m = 

So = 



(0.34029138065736869, 0.26130947813348798), 

0.43668091668767117 -0.42663095837289156 1 
-0.42663095837289161 0.63899446830332574) J 
mi = (0.95591075380718404,0.6544489172032838), 

0.79712692342719804 -0.033060250957646142 
0.033060250957646142 0.14609813043797121 




Sr ~- 



sf = 



Si = 



(0.29050997932657774,0.53527112890397821), 

0.33728018979019664 -0.13844874409795613 
-0.13844874409795613 0.2321103610207193 
(0.64810106723227623,0.45787919766838603), 

0.71165072320677747 -0.16933954090511438 
-0.16933954090511441 0.43118595400867693 

(0.42475123207621085, 0.5062178606510539), 

0.50780328118070528 -0.15653432651371618 
-0.15653432651371618 0.30824860232457035 
(0.46930552327942698, 0.49657516328618234), 



0.55643330303588234 
-0.1608128087229499 



-0.16081280872294987 
0.33314553526979185 



Information radius: 



• right, loft: 0.83419372149741644 

• symmetrized: 0.64099815325721565 

• geodesic A = i: 0.6525069280087431 



We give other pictorial results below for n = 2 and n = 10 bivariate normals, respectively. 




16 



